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A BIAS CORRECTION FOR THE MINIMUM ERROR RATE IN 

CROSS-VALIDATION 

By Ryan J. Tibshirani^ and Robert Tibshirani^ 
Stanford University and Stanford University 

Tuning parameters in supervised learning problems are often esti- 
mated by cross-validation. The minimum value of the cross-validation 
error can be biased downward as an estimate of the test error at that 
same value of the tuning parameter. We propose a simple method 
for the estimation of this bias that uses information from the cross- 
validation process. As a result, it requires essentially no additional 
computation. We apply our bias estimate to a number of popular 
classifiers in various settings, and examine its performance. 

1. Introduction. Cross-validation is widely used in regression and classi- 
fication problems to choose the value of a "tuning parameter" in a prediction 
model. By training and testing the model on separate subsets of the data, 
we get an idea of the model's prediction strength as a function of the tuning 
parameter, and we choose the parameter value to minimize the CV error 
curve. This estimate admits many nice properties [see Stone (1977) for a 
discussion of asymptotic consistency and efficiency] and works well in prac- 
tice. 

However, the minimum CV error itself tends to be too optimistic as an 
estimate of true prediction error. Many have noticed this downward bias 
in the minimum error rate. Breiman et al. (1984) acknowledge this bias 
in the context of classification and regression trees. Efron (2008) discusses 
this problem in the setting n, and employs an empirical Bayes method, 
which does not involve cross-validation in the choice of tuning parameters, 
to avoid such a bias. However, the proposed algorithm requires an initial 
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choice for a "target error rate," which comphcates matters by introducing 
another tuning parameter. Varma and Simon (2006) suggest a method us- 
ing "nested" cross- vahdation to estimate the true error rate. This essentiahy 
amounts to doing a cross-vahdation procedure for every data point, and is 
hence impractical in settings where cross-vahdation is computationally ex- 
pensive. 

We propose a bias correction for the minimum CV error rate in K-iold 
cross-validation. It is computed directly from the individual error curves 
from each fold and, hence, does not require a significant amount of additional 
computation. 

Figure 1 shows an example. The data come from the laboratory of Dr. Pat 
Brown of Stanford, consisting of gene expression measurements over 4718 
genes on 128 patient samples, 88 from healthy tissues and 40 from CNS 
tumors. We randomly divided the data in half, into training and test samples, 
and applied the nearest shrunken centroids classifier Tibshirani et al. (2001) 
with 10-fold cross-validation, using the pamr package in the R language. The 
figure shows the CV curve, with its minimum at 23 genes, achieving a CV 
error rate of 4.7%. The test error at 23 genes is 8%. The estimate of the CV 
bias, using the method described in this paper, is 2.7%, yielding an adjusted 
error of 4.7 + 2.7 = 7.4%. Over 100 repeats of this experiment, the average 
test error was 7.8%, and the average adjusted CV error was 7.3%. 

In this paper we study the CV bias problem and examine the accuracy of 
our proposed adjustment on simulated data. These examples suggest that 
the bias is larger when the signal-to-noise ratio is lower, a fact also noted 
by Efron (2008). We also provide a short theoretical section examining the 
expectation of the bias when there is no signal at all. 

2. Model selection using cross-validation. Suppose we observe n inde- 
pendent and identically distributed points {xi,yi), where Xi = {xn, . . . ,Xip) 
is a vector of predictors, and i/i is a response (this can be real-valued or 
discrete). From this "training" data we estimate a prediction model f{x) for 
y, and we have a loss function L{y,f{x)) that measures the error between 
y and f{x). Typically, this is 

L{y , f {x)) = {y — f {x))^ squared error 

for regression, and 

L{yJ{x)) = l{y^f{x)] 0-1 loss 

for classification. 

An important quantity is the expected prediction error E[L(yo, /(^^o))] 
(also called expected test error). This is the expected value of the loss when 
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Fig. 1. Brown microarray cancer data: the CV error curve is minimized at 23 genes, 
achieving a CV error of 0.047. Meanwhile, the test error at 23 genes is 0.08, drawn 
as a dashed line. The proposed bias estimate is 0.027, giving an adjusted error of 
0.047 + 0.027 = 0.074, drawn as a dotted line. 



predicting an independent data point {xo,yo), drawn from the same dis- 
tribution as our training data. The expectation is over all that is random 
[namely, the model / and the test point (xo,yo)]- 

Suppose that our prediction model depends on a parameter 6, that is, 
f{x) = f{x,6). We want to select 6 based on the training set {xi,yi),i = 
1, . . . ,n, in order to minimize the expected prediction error. 

One of the simplest and most popular methods for doing this is K-iold 
cross-validation. We first split our data {xi,yi) into K equal parts. Then 
for each A; = 1, . . . , i^T, we remove the fcth part from our data set and fit a 
model f~^{x,6). Let be the indices of observations in the kth. fold. The 
cross-validation estimate of the expected test error is 

(1) cv(^) = ix: EM?/.,r'(^.^))- 

Recall that f~^{xi,9) is a function of 9, so we compute CV(^) over a grid of 
parameter values 0i, . . . ,6t, and choose the minimizer 9 to be our parameter 
estimate. We call C\{9) the "CV error curve." 
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3. Bias correction. We would like to estimate the expected test error 
using f{x,6), namely, 

Err = E[L(yo,/(xo,^))]. 
The naive estimate is CV(^), having bias 
(2) Bias = Err - CV((9). 

This is likely to be positive, since 6 was chosen because it minimizes CV(^). 
Let rik be the number of observations in the kth. fold, and define 

This is the error curve computed from the predictions in the /cth fold. 

Our estimate uses the difference between the value of at and its 
minimum to mimic the bias in cross-validation. Specifically, we propose the 
following estimate: 

1 



(3) Bias = -^[ek{e)-ek{ek% 

k=i 



where dk is the minimizer of ek{9). Note that this estimate uses only quanti- 
ties that have already been computed for the CV estimate (1), and requires 
no new model fitting. Since Bias is a mean over K folds, we can also use 
the standard error of the mean as an approximate estimate for its standard 
deviation. 

The adjusted estimate of test error is CY{9) + Bias. Note that if the fold 
sizes are equal, then CV(^) = -^J2k=i^ki9) and the adjusted estimate of 
test error is 

CY{e) + Bias = 2 CV(0) - ^ E ^kiOk). 

^ k=i 

The intuitive motivation for the estimate Bias is as follows: first, 
Gk{dk) ~ CY{6) since both are error curves evaluated at their minima; the 
latter uses all K folds, while the former uses just fold k. Second, for fixed 
9, cross-validation error estimates the expected test error, so that ek{9) « 
E[L{yJ{x,9))].Thus,eki9)^EYT. 

The second analogy is not perfect: Err = E[L(y, /(x, 0))], where {x,y) is 
stochastically independent of the training data, and hence of 6. In contrast, 
the terms in €^{6) are L{yi, f~^{xi,9)),i € C^; here {xi,yi) has some depen- 
dence on 9 since 9 is chosen to minimize the validation error across all folds, 
including the A;th one. To remove this dependence, one would have to carry 
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out a new cross-validation for each of the K original folds, which is much 
more computationally expensive. 

There is a similarity between the bias estimate in (3) and bootstrap es- 
timates of bias in Efron (1979) and Efron and Tibshirani (1993). Suppose 
that we have data z = {zi, Z2, ■ ■ ■ , Zn) and a statistic s{z). Let z*^, z*^, . . . , z*^ 
be bootstrap samples each of size n drawn with replacement from z. Then 
the bootstrap estimate of bias is 

1 B 

(4) Biasboot = 7jE[^(^*')-^(^)]- 

Suppose that s{z) is a functional statistic and hence can be written as t{F), 
where F is the empirical distribution function. Then Biasboot approximates 
Ei?[i(F)] — t(-F), the expected bias in the original statistic as an estimate of 
the true parameter t{F). 

Now to estimate the quantity Bias in (2), we could apply the bootstrap 
estimate in (4). This would entail drawing bootstrap samples and computing 
a new cross-validation curve from each sample. Then we would compute the 
difference between the minimum of the curve and the value of curve at 
the training set minimizer. In detail, let CV(z*,0(z)) be the value of the 
cross-validation curve computed on the dataset z* and evaluated at 9{z)^ 
the minimizer for the CV curve computed on dataset z. Then the bootstrap 
estimate of bias can be expressed as 

(5) l^j2[c\{z*\e{z))-c\{z*\e{z*'))\. 

The computation of this estimate is expensive, requiring B i^-fold cross- 
validations, where B is typically 100 or more. The estimate in Bias in (3) 
finesses this by using the original cross-validation folds to approximate the 
bias in place of the bootstrap samples. 

In the next section we examine the performance of our estimate in various 
contexts. 

4. Application to simulated data. We carried out a simulation study to 
examine the size of the CV Bias, and the accuracy of our proposed adjust- 
ment (3). The data were generated as standard Gaussian in two settings: 
p <n{n = 400, p = 100) andp {n = 40, p = 1000). There were two classes 
of equal size. For each of these we created two settings: "no signal," in which 
the class labels were independent of the features, and "signal," where the 
mean of the first 10% of the features was shifted to be 0.5 units higher in 
class 2. 

In each of these settings we applied five different classifiers: LDA (lin- 
ear discriminant analysis), SVM (linear support vector machines), CART 
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Table 1 

Results for proposed bias correction for the minimum CV error, using 10-fold 
cross-validation. Shown are mean and standard error over 100 simulations, for five 

different classifiers 





Method 


Min CV error 


Test error 


Adjusted CV error 








P <C 71 








No signal 


LDA 


u.ouo 


\\j.\J\)0 ) 


0.5 


u.ouo 


en nri'^"! 




SVM 




\y}.\j\jO J 


0.5 


n "ii 1 






CART 


0.474 


(0.003) 


0.5 


0.510 


(0.004) 




KNN 


0.473 


(0.002) 


0.5 


0.524 


(0.003) 




GBM 


0.475 


(0.003) 


0.5 


0.520 


(0.003) 


Signal 


LDA 


0.290 


(0.003) 


0.284 (0.001) 


0.290 


(0.003) 




SVM 


0.257 


(0.003) 


0.260 (0.001) 


0.279 


(0.003) 




CART 


0.356 


(0.003) 


0.378 (0.002) 


0.384 


(0.003) 




KNN 


0.291 


(0.003) 


0.284 (0.002) 


0.305 


(0.004) 




GBM 


0.269 


(0.002) 


0.272 (0.002) 


0.288 


(0.003) 
















No signal 


NSC 


0.384 


(0.009) 


0.5 


0.511 


(0.012) 




SVM 


0.475 


(0.009) 


0.5 


0.498 


(0.010) 




CART 


0.498 


(0.011) 


0.5 


0.500 


(0.011) 




KNN 


0.430 


(0.007) 


0.5 


0.577 


(0.009) 




GBM 


0.432 


(0.010) 


0.5 


0.552 


(0.012) 


Signal 


NSC 


0.106 


(0.006) 


0.136 (0.004) 


0.152 


(0.008) 




SVM 


0.142 


(0.007) 


0.138 (0.003) 


0.157 


(0.008) 




CART 


0.432 


(0.012) 


0.432 (0.004) 


0.437 


(0.012) 




KNN 


0.200 


(0.007) 


0.251 (0.005) 


0.297 


(0.010) 




GBM 


0.233 


(0.008) 


0.276 (0.006) 


0.307 


(0.010) 



(classification and regression trees), KNN (ET-nearest neighbors), and GBM 
(gradient boosting machines). In the p S> n setting, the LDA solution is 
not of full rank, so we used diagonal linear discriminant analysis with soft- 
thresholding of the centroids, known as nearest shrunken centroids (NSC). 
Table 1 shows the mean of the test error, minimum CV error (using 10- 
fold CV), true bias, and estimated bias, over 100 simulations. The standard 
errors are given in brackets. 

We see that the bias tends to larger in the "no signal" case, and varies 
significantly depending on the classifier. And it seems to be sizable only 
when n. The bias adjustment is quite accurate in most cases, except 
for the KNN and GBM classifiers when N, when it is too large. With 
only 40 observations, 10-fold CV has just four observations in each fold, and 
this may cause erratic behavior for these highly nonlinear classifiers. Table 2 
shows the results for KNN and GBM when p^ N, with 5-fold CV. Here 
the bias estimate is more accurate, but is still slightly too large. 
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5. Nonnegativitj of the bias. Recall^Section 3, where we introduced 
Bias = Err — CY{9), and our estimate Bias. It follows from the definition 
that Bias > always. We show that for classification problems, E[Bias] > 
when there is no signal. 

Theorem 1. Suppose that there is no true signal, so that yo is stochasti- 
cally independent of xq. Suppose also that we are in the classification setting, 
and yo = 1, ■ ■ ■ ,G with equal probability. Finally suppose that the loss is 0-1, 
L{y, fix)) = l{y / fix)}. Then E[CV(^)] < Err. 

Proof. The proof is quite straightforward. Well Err = 1 — P(yo = fi^o^O)), 
where f{-,0) is fit on the training examples (rri, yi), . . . , (x„, ?/„). Suppose 
that marginally P(/(xo,^) = j) =Pj, for j = 1, . . . ,G. Then, by indepen- 
dence, 

Pivo = fixo,9)) = E P(yo = fixo,9) = i) = E = ^' 
j j 

so Err = By the same argument, E[CV(0)] = for any fixed 6. 

Therefore, 

G-1 



E[CV(^)] = E 



minCV(ei) 



<E[CV(^i)] 



G 



which completes the proof. □ 



Now suppose that there is no signal and we are in the regression setting 
with squared error loss, fix)) = (y — /(x))^. We conjecture that indeed 
E[CV(0)] < Err for a fairly general class of models /. 

Let (xi,yi), . . . , ixn,yn) denote n test points, independent of the training 
data and drawn from the same distribution. Consider doing cross-validation 
on the test set in order to determine a value for 6 (just treating the test 
data like it were training data). That is, define 

k=iieCk 



Table 2 

Results for KNN and GBM when p'S> N , with 5-fold cross-validation 



Classifier Setting Min CV error Test error Adjusted CV error 



KNN No signal 0.430 (0.007) 0.5 0.524 (0.009) 

KNN Signal 0.213 (0.007) 0.253 (0.005) 0.281 (0.009) 

GBM No signal 0.425 (0.008) 0.5 0.511 (0.010) 

GBM Signal 0.265 (0.008) 0.289 (0.007) 0.325 (0.010) 
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where / is fit on all test examples {xi,yi) except those in the kth fold. 
Let 6 be the minimizer of CY{9) over ^i, . . . ,0t. Then 

E[CV(^)] =E[CV(6i)] <E[CV(^)], 

where the first step is true by symmetry, and the second is true by definition 
of 9. But (assuming for notational simplicity that 1 G Ci) E[CV(0)] = E[(yi — 
f~^{xi,6))'^], and we conjecture that 

(6) Eiim- f-\xue)f]=E[{y,- f-\S:i,0))\ 

Intuitively, since there is no signal, f{-,6) and f{-,9) should predict equally 
well against a new example {xi,yi), because 9 should not have any real 
relation to predictive strength. 

For example, if we are doing ridge regression with p = l and K = n (leave- 
one-out CV), and we assume that each fixed (nonrandom), then 
we can write out the model f~^{-,9) explicitly. In this case, we can show (6) 
is equivalent to showing 

E[y,\9]=E[y,], E[yl\9] = E[yl] and E[yiy2\9] =E[yi]E[y2]. 

In words, the mean and variance of yi are unchanged by conditioning on 9, 
and yi,y2 are conditionally independent given 9. These certainly seem true 
when looking at simulations, but are hard to prove rigorously because of the 
complicated relationship between the yi and 9. 
Similarly, we conjecture that 

(7) E[(yi - r\x,,9)f]=E[{y^ - KxiM], 

because there is no signal. If we could show (6) and (7), then we would have 
E[CV(0")] < E[(yi - fixi,9)f] = Err. 

6. Discussion. We have proposed a simple estimate of the bias of the 
minimum error rate in cross-validation. It is easy to compute, requiring es- 
sentially no additional computation after the initial cross-validation. Our 
studies indicate that it is reasonably accurate in general. We also found that 
the bias itself is only an issue when N and its magnitude varies consid- 
erably depending on the classifier. For this reason, it can be misleading to 
compare the CV error rates when choosing between models (e.g., choosing 
between NSC and SVM); in this situation the bias estimate is very impor- 
tant. 
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